Alex Deich adeich2@illinois.edu
Entropy is a distinctly muddy subject for me, and I've always been a little confused and curious about how it works for different physical systems. I know that in physics, entropy always increases, but is this true for all systems? How easy is it to make a system whose entropy decreases? How does physics entropy compare with entropy in information theory?
Recently, I was reading about cellular automata1, and I wondered how the entropy of these things changed over time. After all, certain cellular automata (like Conway's Game of Life) are often touted as toy models which eerily recreate a lot of the complexity found in nature. Does the entropy of these models reflect this qualitative property? This turned out to be a poorly-posed question and the process of making it a better question taught me a lot about cellular automata and entropy.
I dislike the saying "entropy is a measure of disorder" (why would Nature's idea of disorder align with mine?), and so I read Wikipedia for a while. It turns out that while all definitions of entropy are ultimately equivalent, each notion has a natural interpretation.
The first definition I looked at was Shannon entropy, which was what Claude Shannon devised for studying information theory. He was looking at how strings of text were arriving over a radio, and wanted a way to quantify how much "new information" each successive new character provided to the string.
For example, if you had recieved 6 0's in a row, 000000, there's not very much information. If you were to describe this string, you would say something like "6 0's". This counts as a form of lossless compression: You've taken a message of 6 bits and given a shorter prescription for how to reproduce it exactly2.
Now, If the next digit is a 0, it adds very little information: the compression becomes "7 0's". However, adding a 1, 0000001 represents a relatively large amount of new information. No longer is the compression stating the number of 0's. Instead, it becomes "6 0's, 1 1", which is about twice as big asx the original compression. Indeed, if we had a string of 0's and 1's with no discernable pattern, like 1110010, the compression is about the same size as the string itself: "3 1's, 2 0's, 1 1, 1 0".
So how can we quantify the amount of new information that each additional bit provides?
One way to do it would be to take the fraction of the message which is a 0 and the fraction which is a 1. Call them $p_0$ and $p_1$, respectively. We want a quantity which takes these two values and returns something which tracks with our notions of information contained in a string. One way to do this would be to simply that the information contained in a string is $I = 1 - \max{\{p_0, p_1\}}.$
This would appear to work okay: For a string of all 0's or all 1's, this would return 0, which is a quality we're looking for. However, this and many other simple attempts to define entropy fails in an important regard. When we consider the information contributed by two new bits of information, we should expect the total information to be the sum of the two individuals: $I_\text{total} = I(p_0) + I(p_1)$. However, this is tricky! Because considering the two events at once, we see that there are four possible ways to add two bits to the string: 00, 01, 10, and 11. In other words, the new information of the two events taken together (with $p_1 \times p_2$ possibilities) needs to be the same as the entropy of the two events individually, or
$I(p_1p_2) = I(p_1) + I(p_2)$. The simplest function with this additive property is the logarithm.
So, our new suggestion is that the information communicated by a new bit is $I = \sum_i \log(p_i)$. Historically, this is not how entropy was defined, though. In the simplest case of two equally probable outcomes (0.5\% that the next bit is a 0 or a 1), this will return a negative number, which feels wrong: Sure, we could define entropy to be a negative quantity, but the discussion above really wants us to say that a high-entropy message is a big positive number. For this reason, we take the reciprocal of the probability, which is the same as putting a minus sign in front: $\log(p_i^{-1}) = -\log(p_i)$.
Now, if we sample this quantity for $N$ new bits, the we will get that each bit occurs $n_i = N p_i$ times, for individual probability $p_i$ (in the case of a binary string, $i \in \{0,1\}$). So, finally, the average amount of new information conveyed for every new bit of message is $$ S_\text{Sh.}= -\sum_i p_i \log{p_i}. $$ This is the Shannon entropy.
Because Shannon was thinking about bits, where $i = 2$, he chose $\log_2$ so that messages of equal numbers of 0 and 1 (and therefore maximum entropy) would give a value of 1, but this is really just a scaling factor.
In Python, an implementation of this looks like
def shannon_entropy(message):
tabsize = len(message)
S = 0
n1 = np.count_nonzero(message)
n0 = tabsize - n1
P0 = n0 / tabsize
P1 = n1 / tabsize
if P0 > 0:
S += -P0 * np.log2(P0)
if P1 > 0:
S += -P1 * np.log2(P1)
return(S)
# Low-information strings have low entropy
s1 = [1,1,1,1]
print(f"S({s1}) = {shannon_entropy(s1)}")
# Max-information strings have max entropy
s2 = [0,1,0,1,0,1]
print(f"S({s2}) = {shannon_entropy(s2)}")
To explore these ideas in cellular automata (hereafter CA), I've written a Python package I call CALab, whose source you can view. CALab has code for evolving 2D grids of 1's and 0's with periodic boundaries according to a CA rule which depends only on the current state of a grid to determine the next. There are three rules built-in, and the process for adding your own is straightforward.
What is a cellular automaton?¶
Cellular automata are toy models in which grids of cells can take one of two values, "alive" and "dead". The grids change in time, with their state at a given timestep determined by their state at the previous. The exact rules which are used to determine who's alive and who's dead can be very complicated, but it is often noted how surprisingly simple rules can lead to very complicated structures and phenomena. Conway's Game of Life is one example.
CALab also includes some animation code to animate a CA, along with whatever variable you want to track during its evolution. For example, here's a "glider", a classic persistent pattern in Conway's Game of Life:
from calab import *
import analysis
# calab contains a directory with a bunch of initial conditions
# stored as serialized pickle files
glider = get_ic("ics/life/glider.pkl")
updater = update_rules.game_of_life
animate_ca(glider, updater)
Conway's Game of Life (hereafter GL) works by applying the following rules to each cell in the grid:
These simple rules famously give rise to surprisingly complex structures and patterns. How does the Shannon entropy evolve for various GL configurations? Let's look at the entropy for the glider. Intuitively this should remain constant— There are always 5 living cells (represented as 1's), so the probability that any given cell is alive or dead is the same. Let's check:
animate_ca(glider,
updater,
tracking_variables=[analysis.shannon_entropy],
labels=["Shannon entropy"])
Indeed it is. What about for a more complicated pattern? Say, a random grid. How does GL's Shannon entropy evolve for that?
animate_ca(random_table((15,15)),
update_rules.game_of_life,
nframes=150,
tracking_variables=[analysis.shannon_entropy],
labels=["Shannon entropy"],
verbose=True)
So already we've found something kind of interesting: Random initial conditions tend to lose entropy in GL. This is the first indication that GL doesn't necessarily model physical systems very well. Physical systems tend to increase in entropy over time, while almost all GL configurations lose information over time3 (see the appendix for discussion). This is an important point. That Conway's Game of Life loses information over time implies two things:
Loss of time-reversibility. If the information at timestep $t_n$ is less than the information at $t_{n+1}$, then you cannot recreate $t_n$ given only $t_{n+1}$. It is thought at the moment that reality is time-reversible.
Entropy will tend to decrease. If entropy is a measure of new information, then entropy in GL will decrease over time, as we have seen. Another way of stating the same idea is: If information is always the same, then as systems change, the information about their previous state must go elsewhere in the system. This information is represented by positions and momenta, so this means that the system will have to occupy more and more states to store this information.
Finally, let's look at the following initial condition, where all of the living cells are grouped together in the middle. This is kind of like a confined gas: In the real world, these living cells would like to diffuse through the rest of the box. This is a classic example of increasing entropy.
circle = get_ic("ics/circle.pkl")
animate_ca(circle,
updater,
nframes=40,
tracking_variables=[analysis.shannon_entropy],
labels=["Shannon entropy"])
A rapid loss of entropy! Very bizarre.
So, let's instead look at what happens when we define a new CA rule, one which keeps the amount of information constant (it's said to "conserve information"): The random walk.
The random walk CA iterates over each cell in the grid. At every timestep, it executes these rules:
If the cell is alive, choose a random direction
This is a reasonable model of Brownian motion, which describes how molecules of gasses move.
Let's look at how the Shannon entropy of a random walk CA evolves for an initial condition where all of the particles start out clumped together:
circle = get_ic("ics/circle.pkl")
updater = update_rules.random_walk
animate_ca(circle,
updater,
nframes=20,
tracking_variables=[analysis.shannon_entropy],
labels=["Shannon entropy"],
verbose = True)
The Shannon entropy is constant! We should have expected this: After all, like the glider example, the number of particles (and the amount of information) is staying constant. But then why does a diffusing system in real life grow monotonically in entropy? To see why, we need to talk about entropy as it is understood in statistical mechanics.
At first glance, the new definition of entropy is identical to before:
$$ S_\text{SM} = - \sum_i p_i \log p_i. $$However, what the $p_i$ represent and how we calculate them is now quite different. Rather than asking what the probability of what the next bit in a message will be, we're asking what the probability is of a given "system" being in a certain "state".
What are systems and states?¶
A system is a collection of things that are subject to the laws of physics. Two billiard balls constitute a system, as do a bunch of planets orbiting a star, or a single atom. All that matters is that you define what exactly is in the system, so we can be precise about it. A state is a configuration of the system. Where are the planets in their orbits? How fast are they going? Are the billiard balls colliding? What are their momenta?
In physics, different states often require different amounts of energies to reach them. For example, consider a system of a single basketball near the surface of the Earth. In one state, the ball is one meter above the ground. Another state, the ball is 20 meters above the ground. The second state requires much more energy than the first.
Statistical mechanics says that the probability $p_i$ of a system being found in a given state is related to the energy $E_i$ required to get it in that state[4](#fn4) by
$$ p_i \sim e^{-E_i}. $$So if you have a huge collection of gas around a planet, you're much more likely to find the gas near the surface of the planet than you are far away. This explains why atmospheric density drops off like it does. Now, notice that the $i$ is not counting the two possibilities from before (0 or 1), but is now a potentially gigantic number. The number of possible altitudes a molecule can have above the Earth's surface is an uncountably[5](#fn5) infinite number!
Likewise, if your system is a bunch of gas in a box, then the energy density is lowest if the gas is evenly spread out through the box, rather than concentrated in a corner. Thus, the gas wants to diffuse.
So we want to come up with a way to assign probabilities to cells in any CA such that the entropy will evolve like we expect for physical systems. In particular, we want a diffusing system (like modeled above with the random walk CA) to increase in entropy, rather than stay constant, as Shannon entropy dictates.
To do this, I defined something I call "nearest neighbor" (NN) likelihood, which assigns a probability to every cell that it will become alive or dead in the next timestep. This is simply accomplished by looking at how many of that cell's neighbors are alive: If many of them are, then it has a high chance of being alive in the next generation. If none are, then it has no chance. This is creating a sort of "potential energy" for each cell, which is a function of the number of living neighbor cells.
Of course, this is a naïve method: GL, after all, will kill any cell with 8 living neighbors, while random walks won't. However, this is not an important detail; one can change the specific probabilities however one pleases. The important thing is to assign those probabilities on the cell's immediate neighbors, rather than a global calculation, as Shannon entropy does.
Additionally, because I want to compare NN entropy to Shannon, I scale the NN entropy by its maximum possible value, which is the value yielded for a checkerboard. For this reason, Shannon entropy will always be an upper bound on NN entropy.
Let's see how this new definition of entropy compares to the Shannon entropy in a few examples.
# first, a sanity check. Does a constant grid give 0 entropy?
a = np.ones((8,8))
plt.imshow(a, origin='lower', cmap='gist_yarg')
print(f"nearest neighbor entropy: \n{analysis.nn_entropy(a)}")
# Now, how about a grid with a large concentration of 1's?
a = np.zeros((8,8))
a[:,:4] = 1
plt.imshow(a, origin='lower', cmap='gist_yarg')
print(f"nearest neighbor entropy: \n{analysis.nn_entropy(a)}\n")
print(f"shannon entropy: \n{analysis.shannon_entropy(a)}")
# Now, a maximum entropy situation: alternating 1's and 0's
a = np.zeros((8,8))
a[1::2,::2] = 1
a[::2,1::2] = 1
plt.imshow(a, origin='lower', cmap='gist_yarg')
print(f"nearest neighbor entropy: \n{analysis.nn_entropy(a)}\n")
print(f"shannon entropy: \n{analysis.shannon_entropy(a)}")
# Now, let's look at how it does the random walk CA from before:
circle = get_ic("ics/circle.pkl")
updater = update_rules.random_walk
animate_ca(circle,
updater,
nframes=500,
tracking_variables=[analysis.shannon_entropy,
analysis.nn_entropy],
labels=["Shannon entropy",
"Nearest neighbor entropy"],
verbose = True)